-
Notifications
You must be signed in to change notification settings - Fork 4k
Optimization: stream HTTP responses from rabbit_prometheus_handler
#14885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
rabbit_prometheus_handlerrabbit_prometheus_handler
|
Ah welp, when measured this doesn't look as promising as I thought. I have three EC2 instances, one acting as the scraper and the other two "galactica" and "kestrel" running single-instance brokers with the 100k-classic-queues.json definition import. The scraping node runs this script to scrape from each node every 2 seconds: #! /usr/bin/env bash
N=600
SLEEP=2
for i in $(seq 1 $N)
do
echo "Sleeping ${SLEEP}s... ($i / $N)"
sleep $SLEEP
echo "Ask for metrics from $1... ($i / $N)"
curl -s "http://$1:15692/metrics/per-object" --output /dev/null &
done
waitI swapped which node was running which branch, but we can see that this branch consistently has more EC2 instance-wide memory usage rather than less! Galactica:
Kestrel:
In the first test (01:03 - 01:23) Galactica runs So it looks like this branch is worse for memory usage as-is. I will have to do a bit more digging. Seems like passing the iodata to the Cowboy process might be creating more garbage than writing the data to the ram_file port. We might be able to buffer some of the iodata in the callback or restructure things in the prometheus dep to improve memory usage. |
|
How big does the counter get? You may have increased the number of messages to Cowboy drastically and each message has a cost (data gets processed and buffered at various steps of the sending process). |
|
It's actually not quite as big as I thought, I see 147_161_056 bytes sent in 557 calls to |
|
Setting |
bafd4fc to
988acf3
Compare
`prometheus_text_format:format/1` produces a binary of the format for the entire registry. For clusters with many resources, this can lead to large replies from `/metrics/[:registry]` especially for large registries like `per-object`. Instead of formatting the response and then sending it, we can stream the response by taking advantage of the new `format_into/3` callback (which needs to be added upstream to the `prometheus` dep). This uses `cowboy_req:stream_body/3` to stream the iodata as `prometheus` works through the registry. This should hopefully be a nice memory improvement. The other benefit is that results are sent eagerly. For a stress-testing example, 1. `make run-broker` 2. `rabbitmqctl import_definitions path/to/100k-classic-queues.json` 3. `curl -s localhost:15692/metrics/per-object` Before this change `curl` would wait for around 8 seconds and then the entire response would arrive. With this change the results start streaming in immediately.
988acf3 to
7de28f3
Compare
|
So now we have sound double digit % improvements for both CPU and memory footprint. Awesome! |
|
The peak memory footprint is really at the edge of "double digits" if I'm being honest, it's right around 10% - these instances really have more like 15 GB memory 😅. I thought we would see great peak memory usage improvements here but it's CPU savings instead actually. Reducing the work the GC needs to do seems to pay off. Looking at the |
|
Thanks for looking into this, the improvements are already significant. Your observations about prometheus.erl concur what I saw when switching to ra_counters. I even implemented a simple prometheus-format exporter and it was already better. However, I ultimately didn't use it since we would lose some functionality (eg. |
|
You can stream gzip responses too by using |






prometheus_text_format:format/1produces a binary of the format for the entire registry. For clusters with many resources, this can lead to large replies from/metrics/[:registry]especially for large registries likeper-object. Instead of formatting the response and then sending it, we can stream the response by taking advantage of the newformat_into/3callback (which needs to be added upstream to theprometheusdep). This usescowboy_req:stream_body/3to stream the iodata asprometheusworks through the registry.This should hopefully be a nice memory improvement. The other benefit is that results are sent eagerly. For a stress-testing example,
make run-brokerrabbitmqctl import_definitions path/to/100k-classic-queues.jsoncurl -s localhost:15692/metrics/per-objectBefore this change
curlwould wait for around 8 seconds and then the entire response would arrive. With this change the results start streaming in immediately.Discussed in #14865
Draft as I would like to collect some memory-usage metrics before and after the change...